microservice system
Root Cause Analysis for Microservice Systems via Cascaded Conditional Learning with Hypergraphs
Xie, Shuaiyu, He, Hanbin, Wang, Jian, Li, Bing
Abstract--Root cause analysis in microservice systems typically involves two core tasks: root cause localization (RCL) and failure type identification (FTI). Despite substantial research efforts, conventional diagnostic approaches still face two key challenges. First, these methods predominantly adopt a joint learning paradigm for RCL and FTI to exploit shared information and reduce training time. Second, these existing methods primarily focus on point-to-point relationships between instances, overlooking the group nature of inter-instance influences induced by deployment configurations and load balancing. T o overcome these limitations, we propose CCLH, a novel root cause analysis framework that orchestrates diagnostic tasks based on cascaded conditional learning. CCLH provides a three-level taxonomy for group influences between instances and incorporates a heterogeneous hypergraph to model these relationships, facilitating the simulation of failure propagation. Extensive experiments conducted on datasets from three mi-croservice benchmarks demonstrate that CCLH outperforms state-of-the-art methods in both RCL and FTI. Microservice architecture has been widely adopted by cloud-native enterprises due to its flexibility, scalability, and loose coupling. In microservice systems (MSS), each microser-vice typically reproduces multiple instances, which collaborate with instances affiliated with other microservices to handle user requests [1], [2]. As these systems scale up, they may suffer from reliability issues, aka failures, attributable to the increasing complexity and dynamicity. Worse still, diagnosing failures in microservice systems is labor-intensive and time-consuming, due to the intricate failure propagation and the overwhelming volume of telemetry data. For example, GitHub once took approximately one and a half hours to resolve a failure that disrupted the codespace service, affecting millions of developers and repositories [3]. Traditional root cause analysis (RCA) in MSS encompasses two tasks: root cause localization (RCL) and failure type identification (FTI).
MicroRemed: Benchmarking LLMs in Microservices Remediation
Zhang, Lingzhe, Zhai, Yunpeng, Jia, Tong, Duan, Chiming, He, Minghua, Pan, Leyi, Liu, Zhaoyang, Ding, Bolin, Li, Ying
Large Language Models (LLMs) integrated with agent-based reasoning frameworks have recently shown strong potential for autonomous decision-making and system-level operations. One promising yet underexplored direction is microservice remediation, where the goal is to automatically recover faulty microservice systems. Existing approaches, however, still rely on human-crafted prompts from Site Reliability Engineers (SREs), with LLMs merely converting textual instructions into executable code. To advance research in this area, we introduce MicroRemed, the first benchmark for evaluating LLMs in end-to-end microservice remediation, where models must directly generate executable Ansible playbooks from diagnosis reports to restore system functionality. We further propose ThinkRemed, a multi-agent framework that emulates the reflective and perceptive reasoning of SREs. Experimental results show that MicroRemed presents substantial challenges to current LLMs, while ThinkRemed improves end-to-end remediation performance through iterative reasoning and system reflection. The benchmark is available at https://github.com/LLM4AIOps/MicroRemed.
Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought
Zhang, Lingzhe, Jia, Tong, Wang, Kangjin, Hong, Weijie, Duan, Chiming, He, Minghua, Li, Ying
As contemporary microservice systems become increasingly popular and complex-often comprising hundreds or even thousands of fine-grained, interdependent subsystems-they are facing more frequent failures. Ensuring system reliability thus demands accurate root cause localization. While traces and metrics have proven to be effective data sources for this task, existing methods either heavily rely on pre-defined schemas, which struggle to adapt to evolving operational contexts, or lack interpretability in their reasoning process, thereby leaving Site Reliability Engineers (SREs) confused. In this paper, we conduct a comprehensive study on how SREs localize the root cause of failures, drawing insights from multiple professional SREs across different organizations. Our investigation reveals that human root cause analysis exhibits three key characteristics: recursiveness, multi-dimensional expansion, and cross-modal reasoning. Motivated by these findings, we introduce RCLAgent, an adaptive root cause localization method for microservice systems that leverages a multi-agent recursion-of-thought framework. RCLAgent employs a novel recursion-of-thought strategy to guide the LLM's reasoning process, effectively integrating data from multiple agents and tool-assisted analysis to accurately pinpoint the root cause. Experimental evaluations on various public datasets demonstrate that RCLAgent achieves superior performance by localizing the root cause using only a single request-outperforming state-of-the-art methods that depend on aggregating multiple requests. These results underscore the effectiveness of RCLAgent in enhancing the efficiency and precision of root cause localization in complex microservice environments.
Autonomous Resource Management in Microservice Systems via Reinforcement Learning
Zou, Yujun, Qi, Nia, Deng, Yingnan, Xue, Zhihao, Gong, Ming, Zhang, Wuyang
This paper proposes a reinforcement learning-based method for microservice resource scheduling and optimization, aiming to address issues such as uneven resource allocation, high latency, and insufficient throughput in traditional microservice architectures. In microservice systems, as the number of services and the load increase, efficiently scheduling and allocating resources such as computing power, memory, and storage becomes a critical research challenge. To address this, the paper employs an intelligent scheduling algorithm based on reinforcement learning. Through the interaction between the agent and the environment, the resource allocation strategy is continuously optimized. In the experiments, the paper considers different resource conditions and load scenarios, evaluating the proposed method across multiple dimensions, including response time, throughput, resource utilization, and cost efficiency. The experimental results show that the reinforcement learning-based scheduling method significantly improves system response speed and throughput under low load and high concurrency conditions, while also optimizing resource utilization and reducing energy consumption. Under multi-dimensional resource conditions, the proposed method can consider multiple objectives and achieve optimized resource scheduling. Compared to traditional static resource allocation methods, the reinforcement learning model demonstrates stronger adaptability and optimization capability. It can adjust resource allocation strategies in real time, thereby maintaining good system performance in dynamically changing load and resource environments.
Enabling Autonomic Microservice Management through Self-Learning Agents
Yu, Fenglin, Yang, Fangkai, Qin, Xiaoting, Zhang, Zhiyang, Zhang, Jue, Lin, Qingwei, Zhang, Hongyu, Dang, Yingnong, Rajmohan, Saravan, Zhang, Dongmei, Zhang, Qi
The increasing complexity of modern software systems necessitates robust autonomic self-management capabilities. While Large Language Models (LLMs) demonstrate potential in this domain, they often face challenges in adapting their general knowledge to specific service contexts. To address this limitation, we propose ServiceOdyssey, a self-learning agent system that autonomously manages microservices without requiring prior knowledge of service-specific configurations. By leveraging curriculum learning principles and iterative exploration, ServiceOdyssey progressively develops a deep understanding of operational environments, reducing dependence on human input or static documentation. A prototype built with the Sock Shop microservice demonstrates the potential of this approach for autonomic microservice management.
Are GNNs Effective for Multimodal Fault Diagnosis in Microservice Systems?
Gao, Fei, Xin, Ruyue, Zhang, Yaqiang
Fault diagnosis in microservice systems has increasingly embraced multimodal observation data for a holistic and multifaceted view of the system, with Graph Neural Networks (GNNs) commonly employed to model complex service dependencies. However, despite the intuitive appeal, there remains a lack of compelling justification for the adoption of GNNs, as no direct evidence supports their necessity or effectiveness. To critically evaluate the current use of GNNs, we propose DiagMLP, a simple topology-agnostic baseline as a substitute for GNNs in fault diagnosis frameworks. Through experiments on five public datasets, we surprisingly find that DiagMLP performs competitively with and even outperforms GNN-based methods in fault diagnosis tasks, indicating that the current paradigm of using GNNs to model service dependencies has not yet demonstrated a tangible contribution. We further discuss potential reasons for this observation and advocate shifting the focus from solely pursuing novel model designs to developing challenging datasets, standardizing preprocessing protocols, and critically evaluating the utility of advanced deep learning modules.
Online Multi-modal Root Cause Analysis
Zheng, Lecheng, Chen, Zhengzhang, Chen, Haifeng, He, Jingrui
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Traditional data-driven RCA methods are typically limited to offline applications due to high computational demands, and existing online RCA methods handle only single-modal data, overlooking complex interactions in multi-modal systems. In this paper, we introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization. OCEAN employs a dilated convolutional neural network to capture long-term temporal dependencies and graph neural networks to learn causal relationships among system entities and key performance indicators. We further design a multi-factor attention mechanism to analyze and reassess the relationships among different metrics and log indicators/attributes for enhanced online causal graph learning. Additionally, a contrastive mutual information maximization-based graph fusion module is developed to effectively model the relationships across various modalities. Extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of our proposed method. Root Cause Analysis (RCA) is crucial for identifying the underlying causes of system failures and ensuring the high performance of microservice systems (Wang et al., 2023a; Li et al., 2021; Wang et al., 2023c).
System States Forecasting of Microservices with Dynamic Spatio-Temporal Data
Xu, Yifei, Ge, Jingguo, Tang, Haina, Ding, Shuai, Li, Tong, Li, Hui
In the AIOps (Artificial Intelligence for IT Operations) era, accurately forecasting system states is crucial. In microservices systems, this task encounters the challenge of dynamic and complex spatio-temporal relationships among microservice instances, primarily due to dynamic deployments, diverse call paths, and cascading effects among instances. Current time-series forecasting methods, which focus mainly on intrinsic patterns, are insufficient in environments where spatial relationships are critical. Similarly, spatio-temporal graph approaches often neglect the nature of temporal trend, concentrating mostly on message passing between nodes. Moreover, current research in microservices domain frequently underestimates the importance of network metrics and topological structures in capturing the evolving dynamics of systems. This paper introduces STMformer, a model tailored for forecasting system states in microservices environments, capable of handling multi-node and multivariate time series. Our method leverages dynamic network connection data and topological information to assist in modeling the intricate spatio-temporal relationships within the system. Additionally, we integrate the PatchCrossAttention module to compute the impact of cascading effects globally. We have developed a dataset based on a microservices system and conducted comprehensive experiments with STMformer against leading methods. In both short-term and long-term forecasting tasks, our model consistently achieved a 8.6% reduction in MAE(Mean Absolute Error) and a 2.2% reduction in MSE (Mean Squared Error). The source code is available at https://github.com/xuyifeiiie/STMformer.
A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management
Sun, Yongqian, Wang, Jiaju, Li, Zhengdan, Nie, Xiaohui, Ma, Minghua, Zhang, Shenglin, Ji, Yuhe, Zhang, Lu, Long, Wen, Chen, Hengmao, Luo, Yongnan, Pei, Dan
AIOps algorithms play a crucial role in the maintenance of microservice systems. Many previous benchmarks' performance leaderboard provides valuable guidance for selecting appropriate algorithms. However, existing AIOps benchmarks mainly utilize offline datasets to evaluate algorithms. They cannot consistently evaluate the performance of algorithms using real-time datasets, and the operation scenarios for evaluation are static, which is insufficient for effective algorithm selection. To address these issues, we propose an evaluation-consistent and scenario-oriented evaluation framework named MicroServo. The core idea is to build a live microservice benchmark to generate real-time datasets and consistently simulate the specific operation scenarios on it. MicroServo supports different leaderboards by selecting specific algorithms and datasets according to the operation scenarios. It also supports the deployment of various types of algorithms, enabling algorithms hot-plugging. At last, we test MicroServo with three typical microservice operation scenarios to demonstrate its efficiency and usability.
CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems
Zhao, Ziming, Zhang, Tiehua, Shen, Zhishu, Dong, Hai, Ma, Xingjun, Liu, Xianhui, Yang, Yun
In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.